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Abstract 

Issues of equity and fairness across subgroups of the population (e.g., gender or ethnicity) must 
be seriously considered in any standardized testing program. For this reason, many testing 
programs require some means for assessing test characteristics, such as reliability, for subgroups 
of the population. However, often only small sample sizes are available for the subgroups of 
interest. Traditionally used reliability estimates (e.g., Cronbach’s alpha) can have low precision 
for small samples. This study investigated whether an empirical Bayes (EB) technique could 
produce more precise reliability estimates than traditional methods in the presence of small 
samples. Several Bayesian estimates were compared to estimates obtained by other methods 
(e.g., the traditionally and currently used Cronbach’s alpha coefficient), in terms of both bias and 
variance. A secondary purpose of this study was to compare the various EB approaches across 
different sample sizes. This paper also discusses EB estimates of standard error of measurement 
(SEM), their accuracy and precision, and how they compare with SEM estimates derived from 
the alpha. 

Key words: Empirical Bayes, reliability, standard error of measurement, prior distribution 
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Overview and Background 

In an educational environment increasingly influenced by standardized testing, issues of 
equity and fairness across subgroups of the population (e.g., defined by ethnicity or disability 
status) must be seriously considered. An adequate assessment of the characteristics of a test in 
these subgroups becomes essential. Failure to assess the reliability of a test for a subgroup might 
result in the use of an unreliable instrument in high-stakes decision making (e.g., program 
admission or class placement). Alternatively, imprecise measurement of reliability for subgroups 
might result in the exclusion of a beneficial test from consideration as an assessment tool. Either 
one of these possibilities could have highly detrimental effects (e.g., increased social barriers for 
already disadvantaged groups). Many areas of this important issue remain to be addressed. 

Although estimation of subgroup reliability might seem to be a straightforward or even 
trivial matter, it presents several problems. Most notable is that the subsamples from the 
population may be quite small. This small sample size could necessarily have an effect on the 
quality of the reliability estimates. Walker and Zhang (2004) suggested a minimum sample size 
of 125 to 150 for calculating reliability, with at least as many people in the sample as items on 
the test. Another problem involves score range restrictions as a result of subsampling. Reliability 
of a test can be affected by changes in group heterogeneity or by systematic selection of scores 
(e.g., subsampling) because observed variances can be different in the selected subgroup and 
total group (Allen & Yen, 1979). These score range restrictions can attenuate the estimate of 
reliability, possibly leading to erroneous conclusions about the adequacy of the test for the 
subpopulation in question. 

Accordingly, research on estimating reliability with small samples is necessary. In the 
current study, empirical Bayes (EB) techniques were applied to the estimation procedure, 
integrating collateral information (e.g., reliability information on the same test from different 
subpopulations) to improve the accuracy of reliability estimates. This study investigated whether 
the EB-based reliability estimates improved the precision for estimating reliability of subgroups 
of a population, even for very small subsamples. It also compared the various EB approaches 
across different sample sizes. 

No studies to our knowledge directly employ EB techniques to estimate reliability, 
although the use of EB techniques to improve the accuracy of estimates has a long history. 
Several researchers have investigated the use of EB techniques. Braun and Jones (1984) studied 
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the validity of academic predictors of graduate school performance using EB methods and found 
that the EB coefficients provided a useful way of combining information across subsamples. The 
researchers reported that the EB models not only yielded better predictions of first-year grade 
averages than cluster analysis estimates, but also facilitated the accurate assessment of the 
quality of these predictions. Moreover, the prediction equations were quite stable and rarely 
displayed implausible features such as negative weights (Braun & Jones). As Braun and Jones 
combined information across subsamples, the current study also used reliability information 
across subsamples to estimate the target group reliability. 

Edwards and Vevea (2006) examined a subscore augmentation procedure using EB 
adjustments to improve the “overall accuracy” of measurement when information is limited. In a 
situation where tests originally designed for one purpose (e.g., producing a reliable overall score 
to rank examinees) are frequently being pressed into service for other purposes (e.g., providing 
subscores specific to a narrow content area as diagnostic information), an overall score on the 
test may be reliable; however, such a test may not provide reliable diagnostic information (or 
subscores) because the subscores are based on less information than the overall score. Edwards 
and Vevea investigated the feasibility of increasing the reliability of diagnostic subscores by 
incorporating information from the rest of the test. They found that the subscores produced by 
the EB augmentation procedure represented an overall improvement over nonaugmented 
subscores. The magnitude of the improvement gained was a function of the correlation among 
subscales and subscale length (reliability). A main focus of the study by Edwards and Vevea was 
to investigate reliability of diagnostic subscores (not the total test reliability), while for the 
current study the main focus was to estimate total test reliability and standard error of 
measurement (SEM) using subgroup information. 

Bayesian techniques have been applied to differential item functioning (DIF) analysis. 
Zwick and Thayer (2002) applied EB methods to DIF analysis and found the EB estimate of DIF 
to be an improvement over the traditionally used Mantel-Haenszel delta-DIF (MH D-DIF) 
statistic. Sinharay, Dorans, Grant, Blew, and Knorr (2006) investigated a full Bayesian (FB) 
approach to small-sample DIF estimation using the 10 least recent administrations of the 
Praxis 0 ! 8 : Pre-Professional Skills Test (PPST®). They reported that the FB approach performed 
better than the existing MH D-DIF for small samples, but the gain was not substantial. 
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Empirical Bayes techniques have also been proposed to estimate equating relationships 
with small numbers of test takers to improve the stability and accuracy” of equated scores in the 
target population (Livingston & Lewis, 2009, p. 1). The study proposed to estimate the equated 
scores separately at each score point, incorporating relevant prior information into the estimation 
process. This approach is very innovative and has some advantages, but it also has two 
limitations. First, it is not symmetric with respect to the new fonn and reference fonn. Second, in 
some situations the proposed procedure can produce a result that is less accurate than that 
provided by the current equating when the difficulty of the current form to be equated is 
significantly different from the difficulties of the forms used in prior equatings. 

One advantage of the EB statistical methods over traditional frequentist methods is that 
the Bayesian methods can incorporate existing collateral (or prior) infonnation into the inference 
problem and lead to improved estimation, especially for small samples. In this context, EB 
methodology involves the simultaneous estimation of parameters in several samples. The 
combined information from other samples is used to adjust the parameter estimate for a given 
sample in order to make it more precise. Empirical Bayes methodology adjusts the estimate more 
or less depending upon the precision of the estimates obtained from the other samples. The 
success of this procedure depends, among other things, on the strength of the relationships 
among the various samples. The primary purpose of this study is to investigate whether the EB- 
based reliability estimates improve the precision for estimating reliability of subgroups of a 
population, even from very small subsamples ranging from 25 to 250, by comparing the 
reliability estimates to Cronbach’s alpha coefficient. The secondary purpose of this study is to 
compare the various EB approaches across different sample sizes. This paper also discusses EB 
estimates of the SEM, their accuracy and precision, and how they compare with traditional SEM 
estimates. 

The following section briefly describes the Bayes and EB approaches. 

The Bayes Approach 

In general, the Bayesian analysis is a methodology to model and simulate the behavior of 
discrete events under uncertainty using past experience, data, or convenient assumptions in the 
form of a prior distribution (Brandel, 2004). Unlike the frequentist approach, probability in 
Bayesian statistics is not defined as the frequency of the occurrence of an event but as the 
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plausibility that a statement is true given the information (Botje, 2006). The basic assumption of 
Bayes methodology is that a state of variables to be modeled and simulated can be represented 
by probability functions (discrete variables) or probability density functions (continuous 
variables). 

In the case of the Bayes theorem for discrete variables, consider a set of observed data 
y = • ->y n ) with some probability distribution f (y | (?) and an associated vector 0 of 

unknown parameters. Suppose that 0 is also a random vector, having a prior distribution 
g(<9| rj), where // is a vector of hyperparameters. Here 6 and Tj are assumed to be discrete 

variables. The Bayes formula is used to compute the posterior distribution p(B | y, T|) for 
discrete parameter 0 : 


p(01 y,T|) 


f(y l e )g( e |ij) 

X f (y|°)g( 0 |ii)' 


( 1 ) 


In the case of the Bayes theorem for continuous variables, let y = ( V| ,• • •, y n j denote a 
sample from a probability density function by a continuous parameter 0 , with prior density 
distribution g((? | 77 ) . Then the posterior density distribution for 6 is given by 


f (y 10)g(O I ri 

y ’ 1 ' jf(y|8)g(8|>l)a'6 (2) 

If one is unsure of the value of tj , then a proper Bayesian approach would adopt a hyperprior 
distribution h(t|). Then the posterior distribution for 6 is obtained by integrating the 
conditional density function with respect to Tj as well: 
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, Jf(y|8)g(9|i|)h(n)rfri 

P y J/f (y I o)g(e | n)A(n)rfiirfe 

The Empirical Bayes Approach 

As an alternative to (3), simply replace // with an estimate ij that maximizes the 
marginal distribution m(y | r\j: 

m(y|ii) = jf(y|0)g(0|ii)^0. (4) 

This estimate // is then used as a known quantity in (2). Consider, for example, the case in 
which f (y | 6) is a nonnal distribution with mean 0 and known variance G~ . Let g( $ rfj 

also be a normal distribution with hyperparameters T| = N^JU, T 1 j . When the hyperparameters 

are known, or when estimates rj are obtained from the data, then derivation of the posterior 
distribution for 0 is as follows: 


p(eIy) 


P(9,y) 

p(y) 


f (y 19)gQ) 
p(y) 


(5) 


when solving the above equation based on the density function of the normal distribution, 


i 
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p(eIy) 


a 


\[2n 


(y-8) 

2a 2 


2 A 




ce-n ) 

2x 2 


2 A 


p(y) 


l 

-e 

_ 2tegt 


(tT 2 +T 2 ) 
2 a 2 i 2 


0 - 


2 , 2 
i (x+a y 

2 2 ~ 
CT +T 


p(y) 


( 6 ) 


Therefore, p ( 9 | y ) is a normal distribution with variance of 


^2 ? 

CT T 

2 

<J +T 


and mean of 


<j 2 jU + z 2 y 

-~~— derived from equation (6). 

CT" + T 

That is, the posterior distribution for 6 is 


p[0\y) = N 


6 


2,2 2 2 \ 

(J jU + T y <J T ' 


7 7 

<7~ + T 


7 7 

(J- + T 


(7) 


^2 

Let B = —--—. Then the posterior distribution has mean Bju + {\ — B^y and variance 

CT + 

Bt~ = (1 — B) G~ . In this sense, B is a weighting factor that is positively proportional to cr“ 

2 

but inversely proportional to T . 


Methods 

Data 

Data were selected from the Preliminary SAT/National Merit Scholarship Qualifying 

Test (PSAT/NMSQT r ) program with a yearly population size of over three million. The 
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PSAT/NMSQT measures developed critical reading skills, math problem-solving skills, and 
writing skills, which are related to successful perfonnance in college. The PSAT/NMSQT 
critical reading, math, and writing skills sections are shortened versions of the College Board 
SAT Reasoning Test™ and measure the same abilities. The scores for all three measures are 
reported on a two-digit 20-to-80 scale. To meet the purposes of this study, 19 states and 
Washington, DC, were selected as subgroups by varying population size (e.g., small to large) 
and population reliability (e.g., low to high). Four sample sizes (i.e., N= 25, 50, 125, and 250) 
were examined, and these samples were randomly selected from the population for each 
selected state. 

Two hundred replications (or resampling) of these samples of 25, 50, 125, and 250 
examinees were conducted for each state. The smallest sample was fairly small relative to the 
number of items on the test (48 items for critical reading, 38 items for math, and 39 items for 
writing), whereas the largest sample size of 250 represented a sufficient sample size for 
reliability estimation according to previous research (Walker & Zhang, 2004). Descriptive 
statistics for 19 states and Washington, DC, are displayed in Table 1 for each measure (i.e., 
critical reading, math, and writing). 

Research Design 

Test scores, from which reliability can be computed, were available for each person. To 
aid in estimation, we assumed that the reliability coefficient for each state was randomly sampled 
from some distribution with unknown parameter //of states’ reliability coefficients, while 9 had 
a distribution with parameter 7 . We estimated the parameters of this distribution using the data 
from the states in the study. Using Bayes’ formula and taking this distribution as the collateral 
distribution, we obtained the posterior distribution (3) of the reliability coefficient for any given 
subgroup (i.e., state). The mean of this posterior distribution became the EB reliability estimate 
for the subgroup of interest. In practice, the EB estimate can be seen as the subgroup estimate 
that has shrunk so that it is closer to the estimate of the common parameter across all subgroups. 
The amount of shrinkage depends upon the relative precision of the subgroup and common mean 
estimates. 
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Table 1 


Comparison of Population Reliabilities and Standard Error of Measurement for 20 States 




Critical reading 

Math 

Writing 

State 

N 

State 

reliability 

State SEM 

State 

reliability 

State SEM 

State 

reliability 

State SEM 

AL 

24,117 

0.883 

3.500 

0.901 

2.869 

0.872 

3.243 

AZ 

19,415 

0.883 

3.529 

0.899 

2.905 

0.871 

3.297 

AK 

11,821 

0.881 

3.497 

0.892 

2.883 

0.868 

3.258 

CT 

36,180 

0.897 

3.506 

0.913 

2.900 

0.889 

3.283 

DE 

17,877 

0.894 

3.511 

0.904 

2.927 

0.884 

3.283 

W. D.C. 

11,387 

0.911 

3.509 

0.915 

2.868 

0.901 

3.220 

ID 

4,262 

0.875 

3.462 

0.884 

2.841 

0.853 

3.264 

IA 

9,954 

0.865 

3.443 

0.880 

2.803 

0.837 

3.225 

KT 

20,292 

0.879 

3.477 

0.896 

2.859 

0.867 

3.257 

LA 

15,695 

0.876 

3.496 

0.890 

2.866 

0.865 

3.227 

ME 

26,375 

0.879 

3.509 

0.893 

2.919 

0.869 

3.322 

MI 

29,945 

0.878 

3.499 

0.895 

2.884 

0.861 

3.282 

MO 

5,891 

0.867 

3.503 

0.881 

2.873 

0.846 

3.292 

NE 

6,838 

0.860 

3.508 

0.879 

2.871 

0.839 

3.293 

ND 

2,537 

0.852 

3.466 

0.877 

2.815 

0.833 

3.270 

SD 

3,217 

0.853 

3.471 

0.863 

2.835 

0.823 

3.258 

TN 

34,121 

0.900 

3.482 

0.913 

2.861 

0.889 

3.230 

WA 

30,878 

0.898 

3.467 

0.904 

2.894 

0.882 

3.274 

wv 

6,160 

0.866 

3.515 

0.885 

2.907 

0.853 

3.248 

WY 

2,226 

0.864 

3.506 

0.876 

2.897 

0.841 

3.287 

Mean 


0.878 

3.493 

0.892 

2.874 

0.862 

3.266 

SD 


0.016 

0.022 

0.014 

0.033 

0.021 

0.028 

Min 

2,226 

0.852 

3.443 

0.863 

2.803 

0.823 

3.220 

Max 

36,180 

0.911 

3.529 

0.915 

2.927 

0.901 

3.322 


Note. Statistics in bold are the three states with large, medium, and small population size (i.e., 
Connecticut, Louisiana, and Wyoming) to illustrate how population size affects reliability and 
the standard error of measurement (SEM). 
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Criterion for Comparison 

The population reliability coefficient and population SEM for each state were used as 
criteria in this study. 

Empirical Bayes Reliability Estimators 

A total of four EB-based reliability coefficients were estimated in (9) through (13). An 
uncorrected (UC) reliability estimate was determined from Cronbach’s alpha coefficient 
calculated directly from each sample. A mean of the 200 replications of the reliability 
coefficients using EB methods were computed. Then, the mean of each of the four EB reliability 
estimates and the one UC reliability estimate were compared to the population reliability 
(criterion) for each state. 

Reliability estimates were obtained using UC, and EB methods. All estimates were based 
on Cronbach’s alpha coefficient: 


a 


k 


k -1 


ZX 

i=\ _ 

k k 

EZX 

i =1 7=1 


( 8 ) 


2 

where k is the number of items in the test form, and <7, and <7- are the variance and covariance 

l IJ 

of the item scores, respectively. The estimates were compared to the actual test reliability of the 
population for each state. Because the procedure was repeated on multiple samples (i.e., 200 
replications), standard errors and bias estimates were obtained. The average squared bias, 
average variance, and root mean squared error (RMSE) for the EB and conventional methods 
were compared across sample sizes. Four different EB approaches were tried. 

Empirical Bayes Reliability Approach 1. For this approach, EB reliability was estimated 
using the following equation, appropriate for normally distributed estimators: 
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1 


1 


a EBl (R,S J ) = - 


a' 


2 Mr i s + 2 a 


R,S 


<J 


R,S: 


RS, 


- + - 


(9) 


where a 2 R s and ju R S are the variance and mean of the reliability estimates for all states in the 
study for the /th replication ( r. ); a 2 RS is the variance of State j for all 200 replications; and 
a RS (= a rs ) is the reliability estimate for State / ( s .) at the /th replication ( r. ). See Table 2 for an 
overview of the design. 

Empirical Bayes Reliability Approach 2. The reliability estimates Cl l are likely not 
normally distributed. It may, however, be possible to transform them to approximate normality. 


One could, for example, apply Fisher’s z-transformation through Z\a rs ) = ^In 


1 + a n 

\ l ~ a r S J 


(C. 


Lewis, personal communication, September 12, 2006). Once the transformation was carried out, 
(9) can be applied and the results translated back to the original metric via 


exp; 

l 2z '( a r S ) 

}-> 

exp] 

[ 2z 'K) 

i +i 


( 10 ) 


Empirical Bayes (EB) Reliability Approach 3. This approach is very similar to EB 
Reliability Approach 2, except for the treatment of a rs in the estimation ofZ'. If a reliability 

coefficient is considered to be a squared correlation between observed and true score, then the 
square roots of the reliabilities should be found and then Fisher’s z-transformation should be 
used: 


2\a jAln 




l-Ja 


rs J 


( 11 ) 


Once the transformation was carried out, (9) was applied and the results were translated 
back to the original metric via 
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(12) 


exp] 

: 2z 'K,) 

}-0 2 

exp] 

[ 2z '( a n) 

} +1 J 


Empirical Bayes (EB) Reliability Approach 4. In EB Reliability Approach 4, the 
reliability was estimated using the EB SEM fonnula, 

SEM 2 

a EB4 rs ~ 1 EEA 

rs , (13) 

where SEM Em is based on (15) in the next section, and SD rs (= SD rs ) is the standard deviation 
of the score at the z'th replication ( r. ) for State j(s ). 


Table 2 


Layout of the Empirical Bayes Approaches 


Replications 






States 




1 

2 

3 





19 

20 

Mean 

Variance 

1 

CX R 1 S I 

(X R l S 1 

0C R I S, 

• 

• 

• 

• 

CC n o 

y M‘~ > 19 

CC n o 

A l°20 

fa 

R\S 

^ a RiS 

2 

a R 2 S t 











3 

a R,S, 











199 

CC n o 

a 199°1 











200 

CC n o 

A 200°l 

• 

• 

• 

• 

• 

• 

• 

CC n o 
a 200°20 

fa 

RiSj 

cr 2 a 

RiSj 

Mean 

fa 

. 

. 

. 

. 

. 

. 

• 

fa 




R/Si 








RiSio 



Variance 

01 RjSl 

• 

• 

• 

• 

• 

• 

• 

° 2 a 

KiSlO 
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Empirical Bayes Standard Error of Measurement 

Similarly, a total of four EB-based SEM estimates were estimated in (15) through (18) 
and a mean of the 200 replications were computed. Then, the mean of each of the four EB 
reliability estimates and one UC reliability estimate were compared to the population SEM 
(criterion) for each state. 

The current study evaluated empirically the accuracy of reliability estimates computed by 
EB approaches, as well as the traditionally used Cronbach’s alpha coefficient (UC reliability 
estimate). While estimating reliability based on the EB approaches is an attractive approach with 
small sample groups, a potential danger exists when one group is large and the others small. In 
this case, the estimates for the small groups will be regressed toward the estimate for the total 
group, which is dominated by the largest group. The resulting values may give the misleading 
impression that the groups are similar because their EB estimates are similar. Another concern is 
that reliability is sensitive to the degree of heterogeneity within a group. In general, the more 
heterogeneous the group, the higher the reliability (Nunnally & Bernstein, 1994, p. 261). Thus it 
may not be reasonable to try to achieve similar reliability from group to group. Instead, a 
quantity less sensitive to heterogeneity/homogeneity, such as the SEM, may be a more 
appropriate parameter to estimate using EB methods (Charles Lewis, personal communication, 
October 6, 2006; Lord & Novick, 1968). Therefore, four different EB approaches for SEM were 
also tried. 

Empirical Bayes standard error of measurement (EB SEM) Approach 1. This approach 
estimated SEM and EB SEM as follows: 


SEM = SD rs 



(14) 


where SD r s is the standard deviation of the score for State j ( s .) at the /th replication ( r ), and 
a RS is the corresponding reliability: 


2 Er, .s' "*■ 2 SEM R 

n o G no 

SEM EBX (R t S J ) = — - T -|- 


R,S ® RS , 


12 


(15) 



where cr 2 RS and p RS are the variance and mean of SEM for the states at the /th replication ( r. ) 
respectively; cr 2 RS is the variance of SEM for State j for all 200 replications; and SEM RS is 
SEM for State j ( s ) at the z'th replication (« ). 

Empirical Bayes standard error of measurement (EB SEM) Approach 2. This approach 
was estimated using the EB-based reliability described in EB Reliability Approach 1. The 
formula for EB SEM Approach 2 is defined as follows: 

SEM FR2 = SD /l - a FR , 

hB2 rs rs SJ EB\ rs 


Where Ueb is based on (9). 

Empirical Bayes standard error of measurement (EB SEM) Approach 3. This approach 
estimated SEM by employing the EB reliability with z-transformation used for the EB Reliability 
Approach 2: 


SEM EB3 rs = SD ,s^- a EB2 i 


(17) 


where aEB1 is based on (10). 

Empirical Bayes Standard Error of Measurement (EB SEM) Approach 4. This approach 
4 estimated SEM by employing the reliability from EB Reliability Approach 3: 


SEM 


EB4 



(18) 


where aEB \> is based on (12). 

Evaluation Indexes 

The current study compared the estimated EB reliabilities and EB SEMs with each state’s 
population reliabilities and SEMs using average squared bias, average variance, and RMSE. 
Average squared bias for reliability is equivalent to the sum of the average squared mean 
differences between the EB-based reliabilities (or uncorrected reliabilities) and population 
reliabilities for the states, divided by the number of states (i.e., 20). That is, 
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(19) 


Average Squared Bias 


20 

' 200 


2 

E 

^2[PiiSj) -Pi(Sj) 1 

u 


./=' 

i =1 



J 


where p ^ s .) represents the EB-based reliability estimates for the /th state (J = 20) and the /th 

replication (/ = 200) and Pi(s) represent the state population reliability for the /th state which is 

regarded as a criterion reliability. Accordingly, a total of 200 resamplings were replicated for the 
states for the analysis. We regard the population reliability as truth for the evaluation. 

Average variance is defined as the sum of the variances of the 200 replications for the 
states, divided by 20 (the number of states): 


Average Variance = 



RMSE is defined as follows: 


( 20 ) 


RMSE = -JAverage Squared Bias + Average Variance 

Average squared bias for SEM is equivalent to the sum of the average squared mean differences 
between the EB-based SEMs (or UC SEMs) and the population SEM for the states, divided by 
the number of states (i.e., 20). The formulas for the average squared bias, average variance, and 
RMSE for SEM are the same as (19), (20), and (21), replacing the reliability estimators with the 
SEM estimators. 


Procedure 

The current study was conducted using the following steps: 

1. Selected 20 states out of 50 states by varying population size and reliability from 
small to large. 

2. Calculated population reliability and SEM for the 20 states to use as a criterion. 

3. Performed random sampling of four sample sizes of 25, 50, 125, and 250 from the 
population for each of the states. 
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4. Conducted resamplings 200 times for each sample size of 25, 50, 125, and 250 for 
each of the states. 

5. Calculated UC reliability and UC SEM for each sample. 

6 . Calculated four EB reliabilities as in (9) to (13) and EB SEMs as in (15) to (18) for 
each sample. 

7. Compared the UC and EB reliabilities and SEMs to the population reliability and 
SEM across replications for each sample size and state using the evaluation indexes 
(i.e., average squared bias, average variance, and RMSE). 

Results 

Empirical Bayes Reliability Estimators 

Table 1 displays population reliability and SEM for the three measures (critical reading, 
math, and writing) for all 20 states. The mean of the 20 states’ reliabilities for critical reading is 
0.88 and ranges from 0.85 to 0.91; for math, the mean is 0.89 and ranges from 0.86 to 0.92; and 
for writing, the mean is 0.86 and ranges from 0.82 to 0.90. The mean SEMs for all 20 states are 
3.49, 2.87, and 3.27 for critical reading, math, and writing, respectively. The SEM for reading 
ranges from 3.44 to 3.53; for math, from 2.80 to 2.93; and for writing, from 3.22 to 3.32. The 
reliability coefficient for each state for each measure is moderately high; the SEMs also look 
reasonably small. Overall, math reliabilities are slightly higher and SEMs are slightly lower than 
those of critical reading and writing. Statistics in bold font in the Table 1 are for the three states 
with large, medium, and small population size (i.e., Connecticut, Louisiana, and Wyoming) to 
illustrate how population size affects reliability and SEM. As shown, the reliability of the state 
with large population size tends to be higher than that of the state with small population size. 
Table 2 displays a layout of the empirical Bayes approaches for 200 replications across 20 states. 

Table 3 presents a comparison of average squared bias, average variance, and RMSE for 
uncorrected (UC) and empirical Bayes (EB) reliability estimators. 4 The shaded areas in the table 
indicate the smallest indexes within each test and each sample size. As expected, as sample sizes 
are increased from 25 to 250, the extent of average squared bias, average variance, and RMSE 
are reduced. 

For the smallest sample size of 25, EB Reliability Approach 2 (using z-transformation) 
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Table 3 


Comparison of Statistics for Uncorrected and Empirical Bayes Reliability Estimators 



Reliability coefficient 

(Avg. squared bias) xlOO 
Reading Math Writing 

(Avg. variance) xlOO 
Reading Math Writing 

Reading 

RMSE 

Math 

Writing 

<N 

II 

Uncorrected 

Empirical Bayes 

EB with z-transform 

EB with z-transform (sqrt) 
Based on EB SEM 

0.0106 

0.0134 

0.0071 

0.0101 

0.1664 

0.0596 

0.1251 

0.0454 

0.1824 

0.0691 

0.0421 

0.0270 

0.0363 

0.0233 

0.0439 

0.0291 

0.0090 

0.0154 

0.0089 

0.0155 

0.0096 

0.0529 

0.0386 

0.0532 

0.0249 

0.0233 

0.0251 

0.0089 

0.0103 

0.0155 

0.0072 

0.0097 

0.0096 

0.0531 

0.1569 

0.0387 

0.1184 

0.0533 

0.1698 

0.0249 

0.0409 

0.0233 

0.0354 

0.0251 

0.0424 

o 

in 

II 

Uncorrected 

0.0021 

0.0015 

0.0022 

0.0635 

0.0473 

0.0751 

0.0256 

0.0221 

0.0278 


Empirical Bayes 

0.0046 

0.0037 

0.0066 

0.0263 

0.0192 

0.0335 

0.0176 

0.0151 

0.0200 


EB with z-transfonn 

0.0041 

0.0077 

0.0085 

0.0257 

0.0193 

0.0271 

0.0172 

0.0165 

0.0189 


EB with z-transfonn (sqrt) 

0.0041 

0.0077 

0.0086 

0.0257 

0.0193 

0.0271 

0.0172 

0.0165 

0.0189 


Based on EB SEM 

0.0021 

0.0018 

0.0021 

0.0596 

0.0448 

0.0701 

0.0249 

0.0216 

0.0269 

N = 125 

Uncorrected 

0.0005 

0.0003 

0.0007 

0.0229 

0.0174 

0.0284 

0.0153 

0.0133 

0.0171 


Empirical Bayes 

0.0023 

0.0022 

0.0034 

0.0118 

0.0088 

0.0161 

0.0119 

0.0105 

0.0140 


EB with z-transfonn 

0.0023 

0.0042 

0.0063 

0.0119 

0.0094 

0.0136 

0.0119 

0.0117 

0.0141 


EB with z-transfonn (sqrt) 

0.0023 

0.0042 

0.0063 

0.0119 

0.0094 

0.0136 

0.0119 

0.0117 

0.0141 


Based on EB SEM 

0.0005 

0.0004 

0.0007 

0.0216 

0.0165 

0.0265 

0.0149 

0.0130 

0.0165 

N= 250 

Unconected 

0.0001 

0.0001 

0.0001 

0.0109 

0.0082 

0.0131 

0.0105 

0.0091 

0.0115 


Empirical Bayes 

0.0012 

0.0010 

0.0015 

0.0069 

0.0052 

0.0091 

0.0090 

0.0078 

0.0103 


EB with z-transform 

0.0012 

0.0017 

0.0036 

0.0070 

0.0055 

0.0080 

0.0091 

0.0084 

0.0108 


EB with z-transform (sqrt) 

0.0012 

0.0017 

0.0036 

0.0070 

0.0055 

0.0080 

0.0091 

0.0084 

0.0108 


Based on EB SEM 

0.0002 

0.0001 

0.0002 

0.0104 

0.0079 

0.0124 

0.0103 

0.0090 

0.0112 


Note. Shaded areas in the table indicate the smallest indexes within each test and each sample size. EB SEM = empirical Bayes 
standard of error measurement, RMSE = root mean squared error, sqrt = square root. 




produced the smallest bias for reading and writing, and the UC reliability estimate yielded the 
smallest bias for math. For the sample size 50, the UC reliability estimates produced the smallest 
average squared bias for critical reading and math, and EB Reliability Approach 4 (using EB 
SEM) yielded the smallest bias for writing. For the sample sizes of 125 and 250, the UC 
reliability estimates produced the smallest average squared bias for all three measures. The 
results showed that when sample size is very small (i.e., N= 25), EB-based reliability estimates 
produced smaller bias than the bias of the UC reliability estimates for critical reading and 
writing. In terms of average variance and RMSE, EB reliability estimates (either EB Reliability 
Approach 1 or Approach 2) produced relatively small average variance and RMSE for all 
measures. It suggests that EB reliability estimates performed better than the traditional UC 
reliability estimates in terms of stability. As shown in Table 3, EB estimates with z- 
transformation (EB Reliability Approach 2) and EB estimates with z-transformation sqrt (EB 
Reliability Approach 3) are identical. 

Figures 1 through 6 depict distributions of the UC and EB reliability estimates of 200 
replications for critical reading, math, and writing, respectively. The first top panel of Figure 1 
includes the reliability estimates distributions of two population sizes (large and small) among 
the states for sample size 25. Connecticut (N = 36,180) and Wyoming (N= 2,226) were selected 
as large and small population sizes of states, respectively, for illustration purposes. The next 
panel in Figure 1 is for sample size 50, and in Figure 2, it is for sample sizes 125 and 250. In 
each plot, four distributions are depicted: UC reliability (solid line), EB reliability (dashed line), 
EB reliability with z-transformation (dashed-dotted line), and EB reliability using EB SEM 
(dotted line). Since EB estimates in Approach 2 and Approach 3 are nearly identical, Approach 3 
was dropped from the figures. The criterion line (i.e., population reliability estimates) is depicted 
with a vertical solid line. The horizontal axis of the plot shows the reliability and the vertical axis 
shows relative frequency in percent of the 200 replications whose estimates fell at a given level. 
As shown in Figures 1 and 2, the sample size used to estimate the reliability has an impact on 
each distribution of reliability estimates. As sample sizes are increased from 25 to 250, shapes of 
distributions of the UC and EB reliability estimates become closer to each other; and their means 
approach the criterion (dotted vertical line), which is the population reliability value. Distribution 
shapes of UC reliability (solid line) and EB reliability estimates using the EB SEM approach 
(dotted line) are very similar compared to other EB reliability estimates. Their means are also 
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CONNECTICUT, N=36180 


WYOMING, N=2226 




CONNECTICUT, N=36180 


WYOMING, N=2226 




Figure 1. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (N = 200) for 
reading, sample sizes 25 and 50. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transfonnation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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WYOMING, N=2226 




CONNECTICUT, N=36180 
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Reading, sample size 250 



Reading, sample size 250 


Figure 2. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (N = 200) for 
reading, sample sizes 125 and 250. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transfonnation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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Math, sample size 50 



Math, sample size 50 


Figure 3. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (N = 200) for 
mathematics, sample sizes 25 and 50. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transformation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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Figure 4. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (TV = 200) for 
mathematics, sample sizes 125 and 250. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transformation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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Writing, sample size 50 



Writing, sample size 50 


Figure 5. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (N = 200) for 
writing, sample sizes 25 and 50. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transformation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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Figure 6. Frequency distribution of uncorrected and empirical Bayes reliability estimators from the resampling (N = 200) for 
writing, sample sizes 125 and 250. 

Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transfonnation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 




close to the population reliability for all sample sizes with a few exceptions (e.g., sample size 25 
for Wyoming), indicating low bias. It appears that the population size of each state affects the 
distribution shape of the UC and EB reliability estimates using the SEM approach appreciably 
only when the sample sizes are very small. The distribution shapes of UC reliability (solid line) 
and EB reliability estimate using the EB SEM approach (dotted line) are more skewed than other 
EB estimates for the states with small population size (Wyoming), particularly for a sample size 
of 25. Overall, the EB Reliability Approach 1 (dashed line) performs better than the UC 
reliability and other EB reliability estimates in terms of accuracy and stability, particularly when 
both the sample and population sizes are very small (i.e., sample size = 25 for Wyoming). 

Figures 3 and 4 show the comparison of UC and EB reliability distributions for math. The 
distribution shapes and patterns are similar to the findings from the reading plots in Figures 1 and 
2. As sample sizes increase, the shapes of distributions of the UC and EB reliability estimates 
converge and their means approach the population reliability. Distribution shapes of UC 
reliability (solid line) and EB reliability estimates using the EB SEM approach (dotted line) are 
very similar compared to the other EB reliability estimates. Also, their means are close to the 
population reliability estimates for almost all sample sizes. When the sample and population 
sizes are small, the EB Reliability Approach 1 (dashed line) performs better than the UC 
reliability and other EB reliability estimates in tenns of accuracy and stability (i.e., sample size = 
25 for Wyoming). 

Figures 5 and 6 display a comparison of UC and EB reliability distributions for writing. 
The distribution shapes and patterns are similar to the finding from critical reading and math. 
While shapes of distributions of the UC and EB reliability estimates become more similar to 
each other and their means are closer to the population reliability as sample sizes increase, when 
both sample and population sizes are small, EB Reliability Estimate Approach 1 (dashed line) 
performs better than UC and other EB reliability estimates. 

In summary, EB-based approaches (either EB Reliability Approach 1 or EB Reliability 
Approach 2) produced relatively small average variances and RMSEs compared to the UC 
reliability estimates for all three measures. In terms of average bias, EB reliability estimates 
produce relatively smaller average squared bias than UC reliability estimates for reading and 
writing, particularly when sample size is very small, while UC reliability estimates produce 
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relatively smaller average squared bias than EB reliability estimates for all three measures, 
particularly when sample size is greater than 25, except for a few exceptions in writing. 

Among the four EB reliability approaches, Reliability Approach 2 and Reliability 
Approach 3 are nearly identical for average squared bias, average variance, and RMSE for all 
three measures across different sample sizes (see Table 3). Interestingly, EB Reliability 
Approach 4 seems to be more similar to UC approach than other EB reliability approaches in 
terms of average squared bias, average variance, and RMSE. As shown in the Figures 1 to 6, 
when both the sample and population sizes are very small (N= 25 for Wyoming), EB Reliability 
Approach 1 (dashed line) performs better than UC and other EB estimates. 

Empirical Bayes Standard Error of Measurement Estimators 

Table 4 displays a comparison of average squared bias, average variance, and RMSE for 
UC versus EB SEM estimators. 5 As observed in the reliability results, as sample sizes increased, 
average squared bias, average variance, and RMSE of the SEM indexes decreased. Analysis 
results for SEM are very consistent across different sample sizes and measures. Uncorrected 
SEM estimates produce the smallest average squared bias for all three measures across different 
sample sizes. As observed in the reliability results, a sample size of 250 yields the smallest bias. 
For average variance and RMSE, EB SEM Approach 1 (EB methodology applied to the SEM 
computed from the uncorrected reliability index) produces the smallest average variance and 
RMSE for all three measures across different sample sizes. 

Table 4 shows that the estimates produced by EB SEM Approach 3 (EB reliability with z- 
transformation) and EB SEM Approach 4 (EB reliability with z-transformation sqrt) are virtually 
identical for average squared bias, average variance, and RMSE for all three measures across 
different sample sizes. The difference between EB SEM Approach 3 and EB SEM Approach 4 is 
that Approach 3 computed SEM using the EB reliability with Fisher’s z-transformation while 
Approach 4 employed square roots of the EB reliabilities and then used the z-transfonnation. 

Figures 7 through 12 illustrate distributions of the UC and EB SEM estimates across 200 
replications for critical reading, math, and writing, respectively. The top panel of Figure 7 
includes the SEM distributions for states with large and small population sizes for sample size 
25. The next panel is for sample size 50, and so on. As shown in the figures, sample sizes used 
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Table 4 

Comparison of Statistics for Uncorrected and Empirical Bayes Standard Error of Measurement Estimators 




(Avg. squared bias) xlOO 

(Avg. variance) xlOO 


RMSE 



SEM coefficient 

Reading 

Math 

Writing Reading 

Math 

Writing Reading 

Math 

Writing 

A=25 

Uncorrected 

0.0032 

0.0019 

0.0022 

0.6226 

0.5434 

0.6223 

0.0791 

0.0738 

0.0790 


Empirical Bayes SEM 

0.0125 

0.0247 

0.0164 

0.2006 

0.1845 

0.2065 

0.0462 

0.0457 

0.0472 


Based on EB reliability 

1.4567 

0.8733 

1.4441 

4.5296 

3.0007 

3.0202 

0.2446 

0.1968 

0.2113 


Based on EB reliability with z-transform 

1.1286 

1.6020 

1.6965 

3.9668 

3.1806 

2.8176 

0.2257 

0.2187 

0.2125 


Based on EB reliability with z-transform (sqrt) 

1.1289 

1.6026 

1.6971 

3.9645 

3.1766 

2.8186 

0.2257 

0.2186 

0.2125 

N= 50 

Uncorrected 

0.0019 

0.0014 

0.0010 

0.3071 

0.2564 

0.2912 

0.0556 

0.0508 

0.0541 


Empirical Bayes SEM 

0.0113 

0.0201 

0.0132 

0.1037 

0.1000 

0.1086 

0.0339 

0.0347 

0.0349 


Based on EB reliability 

1.0821 

0.6296 

0.9791 

1.4869 

0.9860 

0.9630 

0.1603 

0.1271 

0.1394 


Based on EB reliability with z-transform 

1.0753 

1.1816 

1.4711 

1.3699 

1.0790 

1.0823 

0.1564 

0.1504 

0.1598 


Based on EB reliability with z-transform (sqrt) 

1.0751 

1.1809 

1.4744 

1.3696 

1.0781 

1.0830 

0.1564 

0.1503 

0.1599 

N= 125 Uncorrected 

0.0009 

0.0005 

0.0006 

0.1142 

0.1081 

0.1209 

0.0339 

0.0330 

0.0349 


Empirical Bayes SEM 

0.0086 

0.0124 

0.0115 

0.0453 

0.0535 

0.0522 

0.0232 

0.0257 

0.0252 


Based on EB reliability 

0.5704 

0.3865 

0.5006 

0.3245 

0.2358 

0.2035 

0.0946 

0.0789 

0.0839 


Based on EB reliability with z-transform 

0.6104 

0.7562 

0.9399 

0.3060 

0.2534 

0.2537 

0.0957 

0.1005 

0.1093 


Based on EB reliability with z-transform (sqrt) 0.6102 

0.7554 

0.9424 

0.3060 

0.2532 

0.2540 

0.0957 

0.1004 

0.1094 

N = 250 Uncorrected 

0.0003 

0.0001 

0.0002 

0.0582 

0.0512 

0.0578 

0.0242 

0.0227 

0.0241 


Empirical Bayes SEM 

0.0071 

0.0070 

0.0070 

0.0263 

0.0313 

0.0306 

0.0183 

0.0196 

0.0194 


Based on EB reliability 

0.2692 

0.1708 

0.2019 

0.0926 

0.0677 

0.0569 

0.0602 

0.0488 

0.0509 


Based on EB reliability with z-transform 

0.2917 

0.3183 

0.4632 

0.0882 

0.0733 

0.0731 

0.0616 

0.0626 

0.0732 


Based on EB reliability with z-transform (sqrt) 0.2916 

0.3178 

0.4648 

0.0882 

0.0733 

0.0731 

0.0616 

0.0625 

0.0733 


Note. Shaded areas in the table indicate the smallest indexes within each test and each sample size. EB SEM = empirical Bayes 
standard of error measurement, RMSE = root mean squared error, sqrt = square root. 
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Figure 7. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for reading, sample sizes 25 and 50. 

Note. UC SEM = uncorrected standard of error measurement, EB SEM = empirical Bayes standard of error measurement, EBsem Rel 
= empirical Bayes reliability standard error of measurement EB zRel SEM = empirical Bayes reliability with z-transformation using 
standard of error measurement. 
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Figure 8. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for reading, sample sizes 125 and 250. 

Note. UC SEM = uncorrected standard of error measurement, EB SEM = empirical Bayes ustandard of error measurement, EBsem 
Rel = empirical Bayes reliability using standard error of measurement EB zRel SEM = empirical Bayes reliability with z- 
transformation using standard of error measurement. 
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Figure 9. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for mathematics, sample sizes 25 and 50. 

Note. UC SEM = uncorrected standard of error measurement, EB SEM = empirical Bayes standard of error measurement, EBsem Rel 
= empirical Bayes reliability standard error of measurement EB zRel SEM = empirical Bayes reliability with z-transformation using 
standard of error measurement. 
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Figure 10. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for mathematics, sample sizes 125 and 250. 

Note. UC SEM = uncorrected standard of error measurement, EB SEM = empirical Bayes standard of error measurement, EBsem Rel 
= empirical Bayes reliability standard error of measurement EB zRel SEM = empirical Bayes reliability with z-transformation using 
standard of error measurement. 
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Figure 11. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for writing, sample sizes 25 and 50. 


Note. UC Rel = uncorrected reliability, EB Rel = empirical Bayes reliability, zEB Rel = empirical Bayes reliability with z- 
transfonnation, EBsem Rel = empirical Bayes reliability using standard error of measurement. 
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Figure 12. Frequency distribution of uncorrected and empirical Bayes standard error of measurement estimators from the 
resampling (N = 200) for writing, sample sizes 125 and 250. 

Note. UC SEM = uncorrected standard of error measurement, EB SEM = empirical Bayes standard of error measurement, EBsem Rel 
= empirical Bayes reliability standard error of measurement EB zRel SEM = empirical Bayes reliability with z-transformation using 
standard of error measurement 







to estimate the SEM impact the SEM distributions. As sample sizes increase from 25 to 250, 
shapes of the distributions of the UC and EB SEM estimates become more similar to each other 
and their means approach the criterion (the population SEM, indicated by the dotted vertical 
line). The shapes of distributions of the UC SEM and EB SEM Approach 1 are most similar 
when compared to other EB SEM distributions. The means of the UC SEM and EB SEM 
Approach 1 are also close to the population SEM for all sample sizes. Population size does not 
appear to affect appreciably the distribution shape of the UC and EB SEM estimates for large 
sample sizes, but it seems to affect the distributions of the EB SEM approaches in the small 
sample size of 25, particularly for EB SEM Approach 2 (dotted-dashed line) and EB SEM 
Approach 3 (dotted line) for all three measures. 

In summary, UC SEM estimates produce smaller average squared bias than EB SEM 
estimates, while EB-based approaches (either EB SEM Approach 1) produce relatively small 
average variances and RMSEs compared to the UC SEM estimates for all three measures. For all 
three measures across different sample sizes, EB SEM Approach 3 and EB SEM Approach 4 are 
nearly identical for bias, error, and RMSE (see Table 3). As displayed in Figures 7 through 12, 
when sample sizes are small (N= 25 or N= 50), EB SEM Approach 1 performs better than the 
UC and other EB SEM approaches. 


Discussion 

In the current study, an EB procedure was evaluated for estimating reliability of 
subgroups of a population, even for very small subgroups; this evaluation was to improve the 
precision and accuracy of reliability estimation by integrating collateral information from the 
reliability of other subgroups (i.e., states). The Bayesian estimates were compared to the 
traditionally and currently used Cronbach’s alpha coefficient, in terms both of average squared 
bias, average variance, and RMSE. 

The general findings for both reliability and SEM estimates from the current study are 
that the EB-based approach produced greater bias but less error, with a few exceptions. Sample 
size seems to have the a sizable impact on both EB and UC analyses results in terms of bias and 
error, and population size does have some impact on the distribution of EB and UC estimates 
only when the sample size is small. 
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More specifically, as sample sizes increase, sizes of average squared bias, average 
variance, and RMSE decrease, the shapes of distributions of the UC and EB estimates became 
more similar to each other and the UC and EB estimates’ means are close to the criterion. 

In regard to population size, it appears that population size of each state does not affect 
appreciably the distribution shape of reliability and SEM estimates, particularly in the large 
sample sizes, but it seems to affect the distributions of the reliability and SEM estimates only in 
small sample size of 25. Particularly, UC reliability and EB Reliability Approach 4, as well as 
EB SEM Approach 2 and EB SEM Approach 3, are distant from a normal distribution when both 
population size and sample size are small. 

For comparison of UC and EB reliability estimates, the EB reliability estimates usually 
produce relatively greater average squared bias than UC reliability estimates except for a few 
exceptions in writing. However, absolute differences in the average squared bias (xlOO) between 
UC and EB reliability estimates are very small even in the small sample size of 25. Differences 
ranged from 0.0002 to 0.0028 for reading, from 0.0016 to 0.0049 for math, and from 0.0010 to 
0.0048 for writing. Particularly, the absolute differences between UC reliability and EB 
Reliability Approach 4 (based on EB SEM) are almost negligible. In addition, EB Reliability 
Approach 1 and EB Reliability Approach 2 produce relatively small average variances and 
RMSEs compared to the UC reliability for all three measures across different sample sizes. 

These results indicate that EB-based reliability estimation seems promising with even small 
sample sizes, particularly in terms of average variance and RMSE. 

Although estimating reliability based on the EB approaches is attractive for small sample 
groups, the EB estimates for the small groups will be regressed toward the estimate for the large 
group when one group is large and the others small. Another concern is that the reliability is 
sensitive to the degree of heterogeneity within a group. Therefore, we calculated the SEM 
estimate, which is a quantity less sensitive to heterogeneity/homogeneity. 

For comparison of the UC and EB SEM estimates, the pattern of results is similar to the 
results that we found from the reliability analysis results, but the SEM results are more consistent 
across different sample sizes and measures than the results from the reliability estimates. The EB 
SEM estimates produce relatively larger average squared bias than UC reliability estimates for 
all three measures across four sample sizes. The EB SEM Approach 1 produced small average 
variances and RMSEs for all three measures across sample sizes. The results of EB SEM 
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Approach 3 and EB SEM Approach 4 were nearly identical for average squared bias, average 
variance, and RMSE for all three tests across different sample sizes. 

As shown in the Table 1, the data used for the current study were reasonably stable and 
did not include any extreme small or large values in population size, reliability coefficients, and 
SEMs. From a practical point of view, the results of the current study provide some support for 
EB-based reliability estimation using collateral information with even very small sample sizes of 
25 in terms of average squared bias, average variance, and RMSE. 

The current study leaves a number of issues to be considered even though the EB-based 
approaches would seem to have some benefits that traditional reliability estimation methods do 
not have. As commonly recognized, EB estimation works better when more groups are used, 
because the between-groups variability must be estimated from the empirical group information, 
so the current study used 20 states. In practice, however, some situations may not have enough 
subgroups (e.g., gender group) and, consequently, not enough collateral information. Or in some 
situations, one group is large (e.g., the White group in the ethnic subgroup) but the other group is 
small (e.g., the American Indian group). In such cases, EB-based reliability estimation could not 
be expected to function very well because the estimates for small groups will be regressed 
toward the estimates for the large group. That is, the effectiveness of the EB-based approaches 
greatly depends on the characteristics of the collateral infonnation. Therefore, it requires prudent 
judgment to select adequate collateral information to obtain precise and accurate reliability 
estimates. 
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Notes 


1 


The continuous probability density function of the nonnal distribution is defined as 





(*-/Q 2 
2 cr 2 


2 

' In the interests of conciseness, this paper will refer to both the 19 states in the study and 
Washington, DC, generally as states. 

Nunnally and Bernstein (1994) made the argument by referring to the formula from classical 

test theory for the reliability coefficient: p xx , = 1 - cr / a x . If the error variance remains 

fairly constant across groups (an assumption of many derivations from classical test theory), 
then the size of the reliability coefficient is completely determined by the variance of X. 

4 Because average squared bias and average variance of UC and EB estimates were very small in 

numerical value (e.g., 0.0001065 for reading average squared bias in sample size 25), those 
indexes were multiplied by 100. 

5 Because average squared bias and average variance of UC and EB estimates were very small in 

numerical value (e.g., 0 000032 for reading average squared bias in sample size 25), 
multiplied by 100 to those indexes. 
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